<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	<title>Tokenizer reference | ElasticSearch 7.7 权威指南中文版</title>
	<meta name="keywords" content="ElasticSearch 权威指南中文版, elasticsearch 7, es7, 实时数据分析，实时数据检索" />
    <meta name="description" content="ElasticSearch 权威指南中文版, elasticsearch 7, es7, 实时数据分析，实时数据检索" />
    <!-- Give IE8 a fighting chance -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
    <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
	<link rel="stylesheet" type="text/css" href="../static/styles.css" />
	<script>
	var _link = 'analysis-tokenizers.html';
    </script>
</head>
<body>
<div class="main-container">
    <section id="content">
        <div class="content-wrapper">
            <section id="guide" lang="zh_cn">
                <div class="container">
                    <div class="row">
                        <div class="col-xs-12 col-sm-8 col-md-8 guide-section">
                            <div style="color:gray; word-break: break-all; font-size:12px;">原英文版地址: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-tokenizers.html" rel="nofollow" target="_blank">https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-tokenizers.html</a>, 原文档版权归 www.elastic.co 所有<br/>本地英文版地址: <a href="../en/analysis-tokenizers.html" rel="nofollow" target="_blank">../en/analysis-tokenizers.html</a></div>
                        <!-- start body -->
                  <div class="page_header">
<strong>重要</strong>: 此版本不会发布额外的bug修复或文档更新。最新信息请参考 <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html" rel="nofollow">当前版本文档</a>。
</div>
<div id="content">
<div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.html">Elasticsearch Guide [7.7]</a></span>
»
<span class="breadcrumb-link"><a href="analysis.html">Text analysis</a></span>
»
<span class="breadcrumb-node">Tokenizer reference</span>
</div>
<div class="navheader">
<span class="prev">
<a href="analysis-whitespace-analyzer.html">« Whitespace Analyzer</a>
</span>
<span class="next">
<a href="analysis-chargroup-tokenizer.html">Char Group Tokenizer »</a>
</span>
</div>
<div class="chapter">
<div class="titlepage"><div><div>
<h2 class="title">
<a id="analysis-tokenizers"></a>Tokenizer reference<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenizers.asciidoc">edit</a>
</h2>
</div></div></div>
<p>A <em>tokenizer</em>  receives a stream of characters, breaks it up into individual
<em>tokens</em> (usually individual words), and outputs a stream of <em>tokens</em>. For
instance, a <a class="xref" href="analysis-whitespace-tokenizer.html" title="Whitespace Tokenizer"><code class="literal">whitespace</code></a> tokenizer breaks
text into tokens whenever it sees any whitespace.  It would convert the text
<code class="literal">"Quick brown fox!"</code> into the terms <code class="literal">[Quick, brown, fox!]</code>.</p>
<p>The tokenizer is also responsible for recording the following:</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
Order or <em>position</em> of each term (used for phrase and word proximity queries)
</li>
<li class="listitem">
Start and end <em>character offsets</em> of the original word which the term
represents (used for highlighting search snippets).
</li>
<li class="listitem">
<em>Token type</em>, a classification of each term produced, such as <code class="literal">&lt;ALPHANUM&gt;</code>,
<code class="literal">&lt;HANGUL&gt;</code>, or <code class="literal">&lt;NUM&gt;</code>. Simpler analyzers only produce the <code class="literal">word</code> token type.
</li>
</ul>
</div>
<p>Elasticsearch has a number of built in tokenizers which can be used to build
<a class="xref" href="analysis-custom-analyzer.html" title="Create a custom analyzer">custom analyzers</a>.</p>
<h3>
<a id="_word_oriented_tokenizers"></a>Word Oriented Tokenizers<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenizers.asciidoc">edit</a>
</h3>
<p>The following tokenizers are usually used for tokenizing full text into
individual words:</p>
<div class="variablelist">
<dl class="variablelist">
<dt>
<span class="term">
<a class="xref" href="analysis-standard-tokenizer.html" title="Standard Tokenizer">Standard Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">standard</code> tokenizer divides text into terms on word boundaries, as
defined by the Unicode Text Segmentation algorithm. It removes most
punctuation symbols. It is the best choice for most languages.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-letter-tokenizer.html" title="Letter Tokenizer">Letter Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">letter</code> tokenizer divides text into terms whenever it encounters a
character which is not a letter.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-lowercase-tokenizer.html" title="Lowercase Tokenizer">Lowercase Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">lowercase</code> tokenizer, like the <code class="literal">letter</code> tokenizer,  divides text into
terms whenever it encounters a character which is not a letter, but it also
lowercases all terms.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-whitespace-tokenizer.html" title="Whitespace Tokenizer">Whitespace Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">whitespace</code> tokenizer divides text into terms whenever it encounters any
whitespace character.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-uaxurlemail-tokenizer.html" title="UAX URL Email Tokenizer">UAX URL Email Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">uax_url_email</code> tokenizer is like the <code class="literal">standard</code> tokenizer except that it
recognises URLs and email addresses as single tokens.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-classic-tokenizer.html" title="Classic Tokenizer">Classic Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">classic</code> tokenizer is a grammar based tokenizer for the English Language.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-thai-tokenizer.html" title="Thai Tokenizer">Thai Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">thai</code> tokenizer segments Thai text into words.
</dd>
</dl>
</div>
<h3>
<a id="_partial_word_tokenizers"></a>Partial Word Tokenizers<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenizers.asciidoc">edit</a>
</h3>
<p>These tokenizers break up text or words into small fragments, for partial word
matching:</p>
<div class="variablelist">
<dl class="variablelist">
<dt>
<span class="term">
<a class="xref" href="analysis-ngram-tokenizer.html" title="N-gram tokenizer">N-Gram Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">ngram</code> tokenizer can break up text into words when it encounters any of
a list of specified characters (e.g. whitespace or punctuation), then it returns
n-grams of each word: a sliding window of continuous letters, e.g. <code class="literal">quick</code> →
<code class="literal">[qu, ui, ic, ck]</code>.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-edgengram-tokenizer.html" title="Edge n-gram tokenizer">Edge N-Gram Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">edge_ngram</code> tokenizer can break up text into words when it encounters any of
a list of specified characters (e.g. whitespace or punctuation), then it returns
n-grams of each word which are anchored to the start of the word, e.g. <code class="literal">quick</code> →
<code class="literal">[q, qu, qui, quic, quick]</code>.
</dd>
</dl>
</div>
<h3>
<a id="_structured_text_tokenizers"></a>Structured Text Tokenizers<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenizers.asciidoc">edit</a>
</h3>
<p>The following tokenizers are usually used with structured text like
identifiers, email addresses, zip codes, and paths, rather than with full
text:</p>
<div class="variablelist">
<dl class="variablelist">
<dt>
<span class="term">
<a class="xref" href="analysis-keyword-tokenizer.html" title="Keyword Tokenizer">Keyword Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">keyword</code> tokenizer is a “noop” tokenizer that accepts whatever text it
is given and outputs the exact same text as a single term.  It can be combined
with token filters like <a class="xref" href="analysis-lowercase-tokenfilter.html" title="Lowercase token filter"><code class="literal">lowercase</code></a> to
normalise the analysed terms.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-pattern-tokenizer.html" title="Pattern Tokenizer">Pattern Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">pattern</code> tokenizer uses a regular expression to either split text into
terms whenever it matches a word separator, or to capture matching text as
terms.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-simplepattern-tokenizer.html" title="Simple Pattern Tokenizer">Simple Pattern Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">simple_pattern</code> tokenizer uses a regular expression to capture matching
text as terms. It uses a restricted subset of regular expression features
and is generally faster than the <code class="literal">pattern</code> tokenizer.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-chargroup-tokenizer.html" title="Char Group Tokenizer">Char Group Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">char_group</code> tokenizer is configurable through sets of characters to split
on, which is usually less expensive than running regular expressions.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-simplepatternsplit-tokenizer.html" title="Simple Pattern Split Tokenizer">Simple Pattern Split Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">simple_pattern_split</code> tokenizer uses the same restricted regular expression
subset as the <code class="literal">simple_pattern</code> tokenizer, but splits the input at matches rather
than returning the matches as terms.
</dd>
<dt>
<span class="term">
<a class="xref" href="analysis-pathhierarchy-tokenizer.html" title="Path Hierarchy Tokenizer">Path Tokenizer</a>
</span>
</dt>
<dd>
The <code class="literal">path_hierarchy</code> tokenizer takes a hierarchical value like a filesystem
path, splits on the path separator, and emits a term for each component in the
tree, e.g. <code class="literal">/foo/bar/baz</code> → <code class="literal">[/foo, /foo/bar, /foo/bar/baz ]</code>.
</dd>
</dl>
</div>
















</div>
<div class="navfooter">
<span class="prev">
<a href="analysis-whitespace-analyzer.html">« Whitespace Analyzer</a>
</span>
<span class="next">
<a href="analysis-chargroup-tokenizer.html">Char Group Tokenizer »</a>
</span>
</div>
</div>

                  <!-- end body -->
                        </div>
                        <div class="col-xs-12 col-sm-4 col-md-4" id="right_col">
                        
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </section>
</div>
<script src="../static/cn.js"></script>
</body>
</html>